Problem statement (Term Deposit Sale)¶Goal:¶Using the data collected from existing customers, build a model that will help the marketing team identify potential customers who are relatively more likely to subscribe term deposit and thus increase their hit ratio.
Resources Available:¶The historical data for this project is available in file
Deliverable – 1 (Exploratory data quality report reflecting the following) – (20):¶1. Univariate analysis (12 marks)
a. Univariate analysis – data types and description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers.
b. Strategies to address the different data challenges such as data pollution, outlier’s treatment and missing values treatment.
c. Please provide comments in jupyter notebook regarding the steps you take and insights drawn from the plots.
2. Multivariate analysis (8 marks)
a. Bi-variate analysis between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms or density curves. Select the most appropriate attributes.
b. Please provide comments in jupyter notebook regarding the steps you take and insights drawn from the plots
Deliverable – 2 (Prepare the data for analytics) – (10)¶Ensure the attribute types are correct. If not, take appropriate actions.
Get the data model ready.
Transform the data i.e. scale / normalize if required
Create the training set and test set in ratio of 70:30
Deliverable – 3 (create the ensemble model) – (30)¶First create models using Logistic Regression and Decision Tree algorithm. Note the model performance by using different matrices. Use confusion matrix to evaluate class level metrics i.e. Precision/Recall. Also reflect the accuracy and F1 score of the model. (10 marks)
Build the ensemble models (Bagging and Boosting) and note the model performance by using different matrices. Use same metrics as in above model. (at least 3 algorithms) (15 marks)
Make a DataFrame to compare models and their metrics. Give conclusion regarding the best algorithm and your reason behind it. (5 marks)
Attribute Information:¶Input variables:
Bank client data:
Related to previous contact:
Other attributes:
Output variable (desired target):
Index:¶Deliverable – 1 (Exploratory Data Quality Report) ¶#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import metrics
from os import system
from sklearn.linear_model import LogisticRegression
from IPython.display import Image
from sklearn.metrics import recall_score, precision_score, f1_score, roc_auc_score,accuracy_score
#Read Data
pdata = pd.read_csv('bank-full.csv')
pdata.head(10)
Observation: No nulls observed. However, many unknowns found. Most of these 'Unknowns' can be considered as missing values.
1. Univariate Analysis:¶#Stastical Analysis
print(pdata.shape)
pdata.info()
pdata.describe().T
#Outlier Analysis: Method used: IQR
percentile_25 = {}
percentile_75 = {}
iqr = {}
lower_bound = {}
upper_bound = {}
for i in pdata:
if pdata[i].dtypes == "int64":
percentile_25[i] = np.percentile(pdata[i],25)
percentile_75[i] = np.percentile(pdata[i],75)
iqr[i] = percentile_75[i] - percentile_25[i]
lower_bound[i] = percentile_25[i] - (1.5 * iqr[i])
upper_bound[i] = percentile_75[i] + (1.5 * iqr[i])
print(percentile_25)
print(percentile_75)
print(iqr)
print(lower_bound)
print(upper_bound)
print("\n")
for i in pdata:
if pdata[i].dtypes == "int64":
print("Count of lower bound outliers for column "+ i + ": " + str(pdata[i][pdata[i] < lower_bound[i]].count()))
print("Count of upper bound outliers for column "+ i + ": " + str(pdata[i][pdata[i] > upper_bound[i]].count()))
Observations:
#Step1: Find out how many nulls each column has:
print("Column Names and number of nulls they contains is as follows:")
print(pdata.isnull().sum()) # Number of nulls in each column of the dataframe
#pdata.isnull().values.any() # This is another way to check if there are any null values in data set
#Step2: Find out how many zeros are present in data.
print(pdata[:][pdata[:] == 0].count()) # Number of zeros in a column
#Step3: We already observed that balance and pdays have negatives. Find the count:
print("Count of negative values in 'balance'")
print(pdata["balance"][pdata["balance"] < 0].count()) # Number of negatives in a column
print("\nCount of negative values in 'pdays'")
print(pdata["pdays"][pdata["pdays"] < 0].count()) # Number of negatives in a column
#Step4: Find number of unique values in each column. Analyse outcome for categorical variables.
pdata.nunique() # Number of unique values in a column
Observations:
#Step5: Part1 Frequency analysis for categorical variables.
# value counts gives us how many times does the value
print(pdata['Target'].value_counts(normalize=True))
print('')
print(pdata['job'].value_counts(normalize=True))
print('')
print(pdata['marital'].value_counts(normalize=True))
print('')
print(pdata['education'].value_counts(normalize=True))
print('')
print(pdata['default'].value_counts(normalize=True))
print('')
print(pdata['housing'].value_counts(normalize=True))
print('')
print(pdata['loan'].value_counts(normalize=True))
print('')
print(pdata['contact'].value_counts(normalize=True))
print('')
print(pdata['month'].value_counts(normalize=True))
print('')
print(pdata['poutcome'].value_counts(normalize=True))
print('')
print(pdata['Target'].value_counts(normalize=True))
Observations:
Note: Further analysis will be performed once the categorical variables are transformed. Click here to jump to the further analysis.
2. Multivariate Analysis:¶#Countplot for categorical variables.
for i in ['month','job', 'marital','education','housing', 'loan','contact','poutcome','default']:
sns.countplot(data=pdata, y=i, hue='Target')
plt.show()
Observations:
pdata.groupby(["Target"]).mean().T
pdata.groupby(["Target"]).median().T
Observations:
Click here to jump to the further analysis.
#sns.pairplot(pdata,diag_kind='kde')
pdata.corr().round(3)
# Plotting correlation to analyze how target variable "Personal Loan" is correlated with other variables
def plot_corr(df, size=12):
corr = df.corr()
fig, ax = plt.subplots(figsize=(size, size))
ax.matshow(corr)
plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical')
plt.yticks(range(len(corr.columns)), corr.columns)
for (i, j), z in np.ndenumerate(corr):
ax.text(j, i, '{:0.3f}'.format(z), ha='center', va='center')
plot_corr(pdata)
Important Note: ¶Above map produced the output only for numeric variables, and therefore, it doesn't provide relationship of target variable with others. Therefore, to understand how "Target" variable is correlated with other variables (assuming that they are independant of each other), multivariate analysis is performed as follows. After that, data will be transformed and some of the pending univariate analysis will be performed. Click here to jump to the further analysis. Lets convert the columns with an 'object' datatype into categorical variables.
Deliverable – 2 (Prepare the data for analytics): ¶1. Ensure the attribute types are correct. If not, take appropriate actions.:¶#Step1: Loop through all columns in the dataframe. Replace strings with an integer.
pdata = pd.read_csv('bank-full.csv')
for feature in pdata.columns: # Loop through all columns in the dataframe
if pdata[feature].dtype == 'object': # Only apply for columns with categorical strings
pdata[feature] = pd.Categorical(pdata[feature])# Replace strings with an integer
pdata.head(10)
#Step2: Get the datatypes of columns in dataframe to ensure correct datatype conversion.
pdata.info()
Observation:¶Its worth noticing that memory usage has reduced from 5.9+ MB to 2.8 MB after above conversion.
#Try to answer: Can we reduce the number of categories in field 'job' before building a model? Analyse the field for that.
print(pdata['job'].value_counts(normalize=True))
print('')
Approach for Data Conversion:¶#Step1: Create dictionaries to enumerate the categories.
replaceStruct = {
"job": {"blue-collar": 1, "management": 2, "technician": 3, "services": 3,
"housemaid": 3, "admin.": 4, "retired": 5, "unemployed": 5,
"student": 5, "self-employed": 6, "entrepreneur": 6, "unknown": -1},
"education": {"primary": 1, "secondary":2 , "tertiary": 3, "unknown": -1},
"contact": {"cellular": 1, "telephone": 2,"unknown": -1},
"month": {"jan": 1, "feb": 2, "mar": 3, "apr": 4, "may": 5, "jun": 6,
"jul": 7, "aug": 8, "sep": 9, "oct": 10, "nov": 11, "dec": 12},
"poutcome": {"success": 1, "failure":2 , "other": -1, "unknown": -1},
"Target": {"no": 0, "yes": 1},
"default": {"no": 0, "yes": 1},
"housing": {"no": 0, "yes": 1},
"loan": {"no": 0, "yes": 1}
}
#Step2: One hot encoding for marital status
oneHotCols=["marital"]
#Step3: Replace categories with the the structures created above.
pdata=pdata.replace(replaceStruct)
pdata=pd.get_dummies(pdata, columns=oneHotCols)
pdata.head(10)
Let's continue with further EDA: ¶Click here to go back to start of EDA.
#Step1: Get the datatypes of columns in dataframe to ensure correct datatype conversion.
pdata.info()
#Re-performing median analysis for Target. mean analysis may not be value added for categorical columns.
pdata.groupby(["Target"]).median().T
Observation:¶Its worth noticing that memory usage has increased from 2.8 MB to 5.6 MB after above conversion. All the columns are non-null numeric columns. Nothing significant to add to previous analysis
To carry out further analysis, and understand how target variable is correlated with other variables (assuming that they are independent), following heat map is plotted.
#Calculate % of Target for 1 vs 0.
n_true = len(pdata.loc[pdata['Target'] == 1])
n_false = len(pdata.loc[pdata['Target'] == 0])
print("Number of true cases: {0} ({1:2.2f}%)".format(n_true, (n_true / (n_true + n_false)) * 100 ))
print("Number of false cases: {0} ({1:2.2f}%)".format(n_false, (n_false / (n_true + n_false)) * 100))
Observations: 11.70% of the clients in the dataset subscribed a term deposit.
#Step2: Plot a heatmap to understand how Target variable is correlated with other variables.
#sns.pairplot(pdata,diag_kind='kde')
pdata.corr().round(3)
# Plotting correlation to analyze how target variable "Personal Loan" is correlated with other variables
def plot_corr(df, size=18):
corr = df.corr()
fig, ax = plt.subplots(figsize=(size, size))
ax.matshow(corr)
plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical')
plt.yticks(range(len(corr.columns)), corr.columns)
for (i, j), z in np.ndenumerate(corr):
ax.text(j, i, '{:0.3f}'.format(z), ha='center', va='center')
plot_corr(pdata)
Observations:¶'age': 0.025. Weak positive.'job': 0.067. Weak positive.'education': 0.042. Weak positive.'default': -0.022. Weak negative.'balance': 0.053. Weak positive.'housing': -0.139. Negative. Strongest in negative.'loan': -0.068. Weak negative.'contact': 0.143. Fairly positive.'day': -0.028. Weak negative.'month': 0.019. Weak positive.'duration': 0.395. Strongest correlation compared to all other variables. Also, this contradicts previous statement that this variable is not correlated with target.'campaign': -0.073. Weak negative.'pdays': 0.104. Fairly positive.'previous': 0.093. Fairly positive.'poutcome': 0.122. Fairly positive.'marital_divorced': 0.003. Very weak positive.'marital_married': -0.060. Weak negative.'marital_single': 0.064. Weak positive.The following pairs of attributes have similar correlation with the target variable.
Note: Pdays, previous, and poutcome are highly correlated to each other. This may have an impact on logistic regression because of multi-colinearity.
4. Create the training set and test set in ratio of 70:30: ¶Approach:¶Target Variable: Target
#Perform Imputation
#from sklearn.impute import SimpleImputer #Import library if not already imported.
##Imputation of Experience
#Assumption: Experience can be either be 0 or +ve. It can't be -ve:
#Step1: Find a list of unique values for Experience column where value is either 0 or negative.
n = pdata[pdata['balance'] == 0].balance.unique()
#Step2: Replace every 0 value with mean
rep_0 = SimpleImputer(missing_values=0, strategy="mean")
cols=['balance']
imputer = rep_0.fit(pdata[cols])
pdata[cols] = imputer.transform(pdata[cols])
pdata.nunique()
print(pdata[:][pdata[:] == 0].count()) # Number of zeros in a column
#Splitting Data to create training and testing datasets
#from sklearn.model_selection import train_test_split #Import library if not already imported.
#Drop target variable and form X
X = pdata.drop(['Target'],axis=1)
#All columns after imputation except the target variable i.e. Personal Loan
#Only include target variable and form Y
Y = pdata['Target']
#Target Variable, i.e. Personal Loan (1=True, 0=False)
#Split original data to form 70% training and 30% testing data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
#1 is just any random seed number
#Check the volume of data available in training and testing dataset.
print("{0:0.2f}% data is in training set".format((len(x_train)/len(pdata.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(pdata.index)) * 100))
#Perform frequency analysis for 'Personal Loan' (values 0s and 1s) in original, training, and testing data
#Original
print("Original Target True Values : {0} ({1:0.2f}%)"
.format(len(pdata.loc[pdata['Target'] == 1])
, (len(pdata.loc[pdata['Target'] == 1])/len(pdata.index)) * 100))
print("Original Target False Values : {0} ({1:0.2f}%)"
.format(len(pdata.loc[pdata['Target'] == 0])
, (len(pdata.loc[pdata['Target'] == 0])/len(pdata.index)) * 100))
print("")
#Training
print("Training Target True Values : {0} ({1:0.2f}%)"
.format(len(y_train[y_train[:] == 1])
, (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Training Target False Values : {0} ({1:0.2f}%)"
.format(len(y_train[y_train[:] == 0])
, (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
#Testing
print("Test Target True Values : {0} ({1:0.2f}%)"
.format(len(y_test[y_test[:] == 1])
, (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Test Target False Values : {0} ({1:0.2f}%)"
.format(len(y_test[y_test[:] == 0])
, (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("")
Insight: True and False values seem properly balanced in training and testing datasets.
Deliverable – 3 (Create The Ensemble Model) ¶1.a. Create models using Logistic Regression. Note the model performance. ¶#Step1: Model Selection
train_score=[]
test_score=[]
solver = ['newton-cg','lbfgs','liblinear','sag','saga']
for i in solver:
model = LogisticRegression(random_state=1,penalty='l2', solver=i,max_iter=10000) # changing values of solver
model.fit(x_train, y_train)
y_predict = model.predict(x_test)
train_score.append(round(model.score(x_train, y_train),3))
test_score.append(round(model.score(x_test, y_test),3))
print(solver)
print()
print(train_score)
print()
print(test_score)
Insight: It appears that all models have good train and test scores. But 'newton-cg' and 'lbfgs' have the best scores for penalty = 'l2'.
#Step2: Perform same steps as above, but keep penalty = 'l1'.
#In this case, only those solvers which work with 'l1' are considered.
train_score=[]
test_score=[]
solver = ['liblinear','saga'] # changing values of solver which works with 'l1'
for i in solver:
model = LogisticRegression(random_state=42,penalty='l1', solver=i,max_iter=10000) #changed penalty to 'l1'
model.fit(x_train, y_train)
y_predict = model.predict(x_test)
train_score.append(round(model.score(x_train, y_train),3))
test_score.append(round(model.score(x_test, y_test),3))
print(solver)
print()
print(train_score)
print()
print(test_score)
Insight: It appears that 'liblinear' with 'l1' has the same scores as 'newton-cg' and 'lbfgs' with l2. Therefore, perform additional analysis on these three models with different values of C.
#Analysis1: Analysis for liblinear with l1
print("solver = liblinear, penalty = l1:")
train_score=[]
test_score=[]
C = [0.01,0.1,0.25,0.5,0.75,1]
for i in C:
model = LogisticRegression(random_state=42,penalty='l1',solver='liblinear',max_iter=10000,C = i)
model.fit(x_train, y_train)
y_predict = model.predict(x_test)
train_score.append(round(model.score(x_train,y_train),3))
# appending training accuracy in a blank list for every run of the loop
test_score.append(round(model.score(x_test, y_test),3))
# appending testing accuracy in a blank list for every run of the loop
print(C)
print(train_score)
print(test_score)
#Analysis2: Analysis for liblinear with l2
print("\nsolver = liblinear, penalty = l2:")
train_score=[]
test_score=[]
C = [0.01,0.1,0.25,0.5,0.75,1]
for i in C:
model = LogisticRegression(random_state=42,penalty='l2',solver='liblinear',max_iter=10000,C = i)
model.fit(x_train, y_train)
y_predict = model.predict(x_test)
train_score.append(round(model.score(x_train,y_train),3))
# appending training accuracy in a blank list for every run of the loop
test_score.append(round(model.score(x_test, y_test),3))
# appending testing accuracy in a blank list for every run of the loop
print(C)
print(train_score)
print(test_score)
#Analysis3: Analysis for newton-cg with l2
print("\nsolver = newton-cg, penalty = l2:")
train_score=[]
test_score=[]
C = [0.01,0.1,0.25,0.5,0.75,1]
for i in C:
model = LogisticRegression(random_state=42,penalty='l2',solver='newton-cg',max_iter=10000,C = i)
model.fit(x_train, y_train)
y_predict = model.predict(x_test)
train_score.append(round(model.score(x_train,y_train),3))
# appending training accuracy in a blank list for every run of the loop
test_score.append(round(model.score(x_test, y_test),3))
# appending testing accuracy in a blank list for every run of the loop
print(C)
print(train_score)
print(test_score)
Conclusion on Model Selection:¶Based on all the above observations, the model chosen is as follows:
#Train and test data using the model of our choice:
# Fit the model on train
model = LogisticRegression(random_state=42,penalty='l1',solver='liblinear',max_iter=10000,C = 1)
model.fit(x_train, y_train)
#predict on test
y_predict = model.predict(x_test)
coef_df = pd.DataFrame(model.coef_)
coef_df['intercept'] = model.intercept_
print(coef_df)
#Print confusion matrix:
cm=metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (3,3))
sns.heatmap(df_cm, annot=True, fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1] )
Important Note: Since we are going to evaluate many models in this exercise, creating a data frame "model_scores" to evaluate model performance at one go.
Jump to the analysis of model_scores
#Since we are going to evaluate many models in this exercise, creating a data frame to evaluate model performance at one go
model_scores = pd.DataFrame(index=['Training','Testing','Recall','Precision','F1 Score','Roc Auc Score'],
columns=['Logistic Regression','Decision Tree','Bagging','AdaBoosting','GradientBoost','RandomForest'])
model_scores['Logistic Regression'] = [model.score(x_train, y_train), model.score(x_test, y_test)
, recall_score(y_test,y_predict), precision_score(y_test,y_predict)
, f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]
model_scores
Insight: All the values are on a lower side, indicating that there is a scope for improvement in this model. Recall is comparatively lower. But low recall seems a good option in this case. This is because the outcome of this model might be used for target campaigning. If so, then customers who fall in True-Positive, False-Positive, and False-Negative can be potential customers that may take personal loans. Here, business will benefit from False-Positives and False-Negative cases. And True-Negatives are can be excluded from the campaign.
1.b. Create models using Decision Tree algorithm. Note the model performance ¶#Create a decision tress without default parameters except for criterion and random_state
dTree = DecisionTreeClassifier(criterion = 'entropy', random_state=1)
dTree.fit(x_train, y_train)
Insight: criterion = 'entropy' showed slightly better results than 'gini'
print(dTree.score(x_train, y_train))
print(dTree.score(x_test, y_test))
#predict on test
y_predict = dTree.predict(x_test)
model_scores['Decision Tree_Overfit'] = [dTree.score(x_train, y_train), dTree.score(x_test, y_test)
, recall_score(y_test,y_predict), precision_score(y_test,y_predict)
, f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]
model_scores
Observation: The model looks overfit, as expected
train_char_label = ['No', 'Yes']
Credit_Tree_File = open('credit_tree.dot','w')
dot_data = tree.export_graphviz(dTree, out_file=Credit_Tree_File, feature_names = list(x_train)
,class_names = list(train_char_label))
Credit_Tree_File.close()
retCode = system("dot -Tpng credit_tree.dot -o credit_tree.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("credit_tree.png"))
Observation:: Very deep tree. Save the above image and open in an image viewer for better clarity.
dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 8, random_state=1
,min_samples_leaf = 10, min_samples_split = 2, splitter = "best")
dTreeR.fit(x_train, y_train)
print(dTreeR.score(x_train, y_train))
print(dTreeR.score(x_test, y_test))
Insight: At max_depth = 8, testing score is highest. The values of parameters in above DecisionTreeClassifier are fixed after several trials and errors.
train_char_label = ['No', 'Yes']
Credit_Tree_FileR = open('credit_treeR.dot','w')
dot_data = tree.export_graphviz(dTreeR, out_file=Credit_Tree_FileR, feature_names = list(x_train)
, class_names = list(train_char_label))
Credit_Tree_FileR.close()
#Works only if "dot" command works on you machine
retCode = system("dot -Tpng credit_treeR.dot -o credit_treeR.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("credit_treeR.png"))
Note: Save the above image and open in an image viewer for better clarity.
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(dTreeR.feature_importances_, columns = ["Imp"], index = x_train.columns))
Note: As observed in the EDA (Heat Map), it can be observed that duration is the most important influencer amongst all the attributes.
y_predict = dTreeR.predict(x_test)
cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
#Print performance indicators for the model
model_scores['Decision Tree'] = [dTreeR.score(x_train, y_train), dTreeR.score(x_test, y_test)
, recall_score(y_test,y_predict), precision_score(y_test,y_predict)
, f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]
model_scores
Overall performance has improved compared with the logistic regression.
2. Build the ensemble models ¶Ensemble Learning - Bagging¶from sklearn.ensemble import BaggingClassifier
bgcl = BaggingClassifier(base_estimator=dTree, n_estimators=50,random_state=1,max_samples= .74)
#bgcl = BaggingClassifier(n_estimators=50,random_state=1)
bgcl = bgcl.fit(x_train, y_train)
y_predict = bgcl.predict(x_test)
print(bgcl.score(x_test , y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
#Print performance indicators for the model
model_scores['Bagging'] = [bgcl.score(x_train, y_train), bgcl.score(x_test, y_test)
, recall_score(y_test,y_predict), precision_score(y_test,y_predict)
, f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]
model_scores
Ensemble Learning - AdaBoosting¶from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier(n_estimators=90, random_state=1)
#abcl = AdaBoostClassifier( n_estimators=50,random_state=1)
abcl = abcl.fit(x_train, y_train)
Note: n_estimators = 90 provides high value of model score.
y_predict = abcl.predict(x_test)
print(abcl.score(x_test , y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
#Print performance indicators for the model
model_scores['AdaBoosting'] = [abcl.score(x_train, y_train), abcl.score(x_test, y_test)
, recall_score(y_test,y_predict), precision_score(y_test,y_predict)
, f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]
model_scores
Ensemble Learning - GradientBoost¶from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 92,random_state=1)
gbcl = gbcl.fit(x_train, y_train)
Note: n_estimators = 92 provides high value of model score.
y_predict = gbcl.predict(x_test)
print(gbcl.score(x_test, y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
#Print performance indicators for the model
model_scores['GradientBoost'] = [gbcl.score(x_train, y_train), gbcl.score(x_test, y_test)
, recall_score(y_test,y_predict), precision_score(y_test,y_predict)
, f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]
model_scores
Ensemble RandomForest Classifier¶from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 93, random_state=1,max_features=10)
rfcl = rfcl.fit(x_train, y_train)
Note: n_estimators = 93 provides high value of model score.
y_predict = rfcl.predict(x_test)
print(rfcl.score(x_test, y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
#Print performance indicators for the model
model_scores['RandomForest'] = [rfcl.score(x_train, y_train), rfcl.score(x_test, y_test)
, recall_score(y_test,y_predict), precision_score(y_test,y_predict)
, f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]
model_scores
3. Make a DataFrame to compare models and their metrics. ¶Data Frame is already prepared here.
model_scores
Observations:¶Conclusion:¶